06-11-2022

OBJECTIVE

While buying a house, the most important questions a buyer may ask himself are - ’What is the price of the house?’ and ’With such price, what features I may get?’.

Based on these queries it is relevant to look for the features which are affecting the house prices and to observe how the features are affecting the prices.

Here our primary objective is to build a linear regression model for predicting House Price.

DESCRIPTION OF THE DATASET

The Ames Housing dataset which consists of 79 different explanatory variables. It is enriched with data on almost every aspect of a house. These explanatory housing parameters focus on the quality and quantity of many physical attributes of the property.

Predictors Number
Continuous Variable 20
Discret Variable 14
Categorical Variable (Nominal) 23
Categorical Variable(Ordinal) 23

Total number of observations: 1459

There are certain categorical predictors having too many levels (eg. ‘Neighborhood’ has 27 levels) as well as predictors having only 2 levels.

DESCRIPTION OF THE DATASET (Continued)

  • Quantitative: 1stFlrSF, 2ndFlrSF, 3SsnPorch, BedroomAbvGr, BsmtFinSF1, BsmtFinSF2, BsmtFullBath, BsmtHalfBath, BsmtUnfSF, EnclosedPorch, Fireplaces, FullBath, GarageArea, GarageCars, GarageYrBlt, GrLivArea, HalfBath, KitchenAbvGr, LotArea, LotFrontage, LowQualFinSF, MSSubClass, MasVnrArea, MiscVal, MoSold, OpenPorchSF, OverallCond, OverallQual, PoolArea, ScreenPorch, TotRmsAbvGrd, TotalBsmtSF, WoodDeckSF, YearBuilt, YearRemodAdd, YrSold

  • Qualitative: Alley, BldgType, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2, BsmtQual, CentralAir, Condition1, Condition2, Electrical, ExterCond, ExterQual, Exterior1st, Exterior2nd, Fence, FireplaceQu, Foundation, Functional, GarageCond, GarageFinish, GarageQual, GarageType, Heating, HeatingQC, HouseStyle, KitchenQual, LandContour, LandSlope, LotConfig, LotShape, MSZoning, MasVnrType, MiscFeature, Neighborhood, PavedDrive, PoolQC, RoofMatl, RoofStyle, SaleCondition, SaleType, Street, Utilities.

DESCRIPTION OF THE DATASET (Continued)

Name Description
SalePrice Selling Price of House
MSSubClass Identifies the type of dwelling involved in the sale
LotArea Lot size in square feet
Neighborhood Physical locations within Ames city limits
BldgType Type of dwelling
OverallQual Rates the overall material and finish of the house
YearBuilt Original construction date
YearRemodAdd Remodel date (same as construction date if no remodeling or additions
Bedroom Bedrooms above grade (does NOT include basement bedrooms)
SaleCondition Condition of sale

Exploratory Data Analysis

Response Variable

Here our response is Price ( ‘SalePrice’).It is a continuous variable and unit is US Dollars.

A brief summary of the response is given below:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  135751  168703  179209  179184  186789  281644

It is to be noted that distribution of the response is positively skewed. <>

A Description of some Predicting Features

0% 25% 50% 75% 100% Mean SD skenewss kurtosis
LotFrontage 21.0 60.0 70.0 80.0 200 69.0 21.0 0.6 6.1
LotArea 1470.0 7391.0 9399.0 11517.5 56600 9819.2 4955.5 3.1 23.7
YearRemodAdd 1950.0 1963.0 1992.0 2004.0 2010 1983.7 21.1 -0.4 1.6
MasVnrArea 0.0 0.0 0.0 162.0 1290 99.7 177.0 2.5 11.4
BsmtFinSF1 0.0 0.0 350.0 752.0 4010 438.9 455.3 1.2 5.7
BsmtFinSF2 0.0 0.0 0.0 0.0 1526 52.6 176.7 4.0 20.6
BsmtUnfSF 0.0 219.0 460.0 797.5 2140 553.9 437.4 0.9 3.3
TotalBsmtSF 0.0 784.0 988.0 1304.0 5095 1045.4 443.6 0.8 8.2
1stFlrSF 407.0 873.5 1079.0 1382.5 5095 1156.5 398.2 1.6 11.0
2ndFlrSF 0.0 0.0 0.0 676.0 1862 326.0 420.6 0.9 2.7
LowQualFinSF 0.0 0.0 0.0 0.0 1064 3.5 44.0 16.2 310.6
GrLivArea 407.0 1117.5 1432.0 1721.0 5095 1486.0 485.6 1.1 5.9
BsmtFullBath 0.0 0.0 0.0 1.0 3 0.4 0.5 0.7 2.4
BsmtHalfBath 0.0 0.0 0.0 0.0 2 0.1 0.3 3.8 16.5
FullBath 0.0 1.0 2.0 2.0 4 1.6 0.6 0.3 2.8
HalfBath 0.0 0.0 0.0 1.0 2 0.4 0.5 0.7 2.0
BedroomAbvGr 0.0 2.0 3.0 3.0 6 2.9 0.8 0.4 4.7
KitchenAbvGr 0.0 1.0 1.0 1.0 2 1.0 0.2 4.1 20.4
TotRmsAbvGrd 3.0 5.0 6.0 7.0 15 6.4 1.5 0.8 4.5
Fireplaces 0.0 0.0 0.0 1.0 4 0.6 0.6 0.8 3.4
GarageYrBlt 0.0 1956.0 1977.0 2001.0 2207 1872.0 445.8 -3.9 16.7
GarageCars 0.0 1.0 2.0 2.0 5 1.8 0.8 -0.1 3.2
GarageArea 0.0 317.5 480.0 576.0 1488 472.4 217.3 0.3 4.0
WoodDeckSF 0.0 0.0 0.0 168.0 1424 93.2 127.7 2.1 13.2
OpenPorchSF 0.0 0.0 28.0 72.0 742 48.3 68.9 2.7 16.0
EnclosedPorch 0.0 0.0 0.0 0.0 1012 24.2 67.2 4.7 43.0
3SsnPorch 0.0 0.0 0.0 0.0 360 1.8 20.2 12.5 172.6
ScreenPorch 0.0 0.0 0.0 0.0 576 17.1 56.6 3.8 20.2
PoolArea 0.0 0.0 0.0 0.0 800 1.7 30.5 20.2 447.1
MiscVal 0.0 0.0 0.0 0.0 17000 58.2 630.8 20.1 472.9
SalePrice 135751.3 168703.0 179208.7 186789.4 281644 179183.9 16518.3 0.9 6.8

Relationship Between Price and Other Features

<>

<>

<>

<>

<>

Analysis of Houses with respect to Age

Price Trend of Houses Built over Years <>

<>

Correlation Heatmap

<> Note that, price of house is highly correlated with the features named: LotArea, GrLivArea, BedroomAbvGr, TotRoomsAbvGr.

<> Note that, BedroomAbvGr and GrLivArea are linearly related as increase in above ground area increases total number of bedrooms above ground.

Data Cleaning

Missing Values Percentage

The dataset originally contains 79 predictors.

We drop the following columns:

Feature Description
Id Identification number of house
Yearbuilt Original construction date
YrSold Year Sold (YYYY)
YearRemodd Remodel date (same as construction date if no remodeling or additions)
PoolQc Quality of Pool

We introduce new predicting column: ‘Age_Sold_Group’, ‘renov_bi’

  • ‘Age_Sold_Group’ = ‘YrSold’-‘YearBuilt’
  • ‘renov_bi’ can take two values. 1 if the house was renovated or 0 if the house was not renovated.

Then, we create dummy of all the categorical predictors.

Treatment of Null Values

Feature Method
LotFrontage Null values replaced by Median, grouped within a certain neighborhood
MasVnrType Null values replaced by mode
MsZoning Null values replaced by mode
Functional Null values replaced by mode
Exterior1st Null values replaced by mode
Exterior2nd Null values replaced by mode
Utilities Null values replaced by mode

LOG TRANSFORMATION OF THE RESPONSE

As we saw that the response is highly positively skewed, we may apply a log transformation over the response.

Histogram of response after log transformation:

<>

SPLITING THE DATASET INTO TRAIN AND TEST SET

We split the dataset into two parts. 80% of the whole data is contained in train set and the remaining 20% is contained in the test set.

Models for Prediction

Here we use Least Square Linear Regression Model.

Response: log(‘SalePrice’)

Since the dataset contains large number of predictors and also since multicollinearity is present in the dataset, it is wise to remove unnecessary predicting features which do not have significant impact on the response.

MODEL 1:

Recall that from the correlation heatmap, we noticed price is highly correlated with ‘LotArea’, ‘BedroomAbvGr’, ‘TotalBsmtSF’.

##                  Estimate   Std. Error    t value Pr(>|t|)
## (Intercept)  1.177705e+01 2.312430e-03 5092.93330        0
## LotArea      1.017735e-05 1.216701e-07   83.64706        0
## BedroomAbvGr 7.538112e-02 7.301208e-04  103.24472        0
  • Multiple R-Squared: 0.9475
  • Residual Standard Error: 0.0204400
  • PRESS:0.4926025

## 791 729 
## 623 575

Prediction over the Test Dataset

Implementing different Variable Selection Method

Here we compare two variable selection method:

  • LASSO

  • Principle Component Analysis

Variable Selection: LASSO

Choice of Lambda

<>

Coefficients vs Fraction Deviance

<>

We see that we need 2 non-zero coefficients in the model is enough to explain 80% of total variation of response.

From LASSO, we get the following predicting columns:

##  [1] "(Intercept)"          "LotFrontage"          "LotArea"             
##  [4] "YearRemodAdd"         "HalfBath"             "BedroomAbvGr"        
##  [7] "KitchenAbvGr"         "TotRmsAbvGrd"         "GarageCars"          
## [10] "EnclosedPorch"        "MSSubClass_30"        "MSSubClass_60"       
## [13] "MSSubClass_80"        "MSSubClass_90"        "MSSubClass_120"      
## [16] "MSSubClass_180"       "MSZoning_FV"          "MSZoning_RL"         
## [19] "Alley_Pave"           "LotShape_IR2"         "LotShape_IR3"        
## [22] "LotConfig_Inside"     "LandSlope_Mod"        "LandSlope_Sev"       
## [25] "Neighborhood_Blueste" "Neighborhood_BrkSide" "Neighborhood_CollgCr"
## [28] "Neighborhood_Crawfor" "Neighborhood_SWISU"   "Condition1_PosN"     
## [31] "Condition1_RRAe"      "Condition1_RRAn"      "BldgType_Duplex"     
## [34] "BldgType_Twnhs"       "BldgType_TwnhsE"      "HouseStyle_2.5Unf"   
## [37] "OverallQual_2"        "OverallCond_4"        "OverallCond_6"       
## [40] "OverallCond_7"        "Exterior2nd_Brk Cmn"  "Exterior2nd_BrkFace" 
## [43] "Exterior2nd_Plywood"  "Exterior2nd_Wd Shng"  "MasVnrType_BrkFace"  
## [46] "MasVnrType_Stone"     "Foundation_CBlock"    "Heating_Grav"        
## [49] "HeatingQC_Gd"         "HeatingQC_Po"         "Functional_Min1"     
## [52] "FireplaceQu_TA"       "GarageType_Attchd"    "GarageQual_Gd"       
## [55] "GarageCond_Po"        "PavedDrive_P"         "MoSold_2"            
## [58] "MoSold_3"             "MoSold_4"             "MoSold_5"            
## [61] "MoSold_6"             "MoSold_7"             "MoSold_8"            
## [64] "MoSold_9"             "MoSold_10"            "MoSold_11"           
## [67] "MoSold_12"            "SaleType_CWD"         "SaleType_New"        
## [70] "SaleType_Oth"

MODEL 2:

We use the predicting columns obtained from LASSO and fit a least square linear model.

We train our model using the training set and obtain the following results:

  • Multiple R-squared: 0.9585
  • Residual Standard Error: 0.0186
  • PRESS: 2833.417

RESIDUAL DIAGNOSTICS

Residual Plot

<> We see that, the residuals are more or less randomly scattered around 0 line.

Q-Q Plot

<>

Q-Q plot shows that the distribution of residual is not normally distributed.

Prediction over the Test Dataset

Density Plot Of House Price

<>

Principle Component Analysis

One way to reduce the number of predictor variables is Principal Component Analysis(PCA). Here, we transform the numerical predictors into orthogonal set of predicting variables such that the new predicting variables explain 95% of the total variation.

Here, through PCA we obtain 21 principal components that explains 95% of the total variation.

Scree Plot

<>

Model 3:

We use the 21 principal components and all other categorical features and fit Least Square Linear Model over the response variable.

We train our model using the training set and obtain the following results:

  • Multiple R-squared: 0.97210
  • Residual Standard Error: 0.01698
  • PRESS: 0.50811

RESIDUAL DIAGNOSTICS

Residual Plot

<>

We see that, the residuals are more or less randomly scattered around 0 line.

Q-Q Plot

<>

Q-Q plot shows that the distribution of residuals is skewed.

Prediction over the Test Dataset

Density Plot of House Price

<>

A Summary

Measures Model_1 Model_2 Model_3
R-Squared 0.9475000 0.9585000 0.9721000
Residual Standard Error 0.0204400 0.0186000 0.0169800
PRESS 0.4926025 0.4374293 0.5081105
MAE 2833.4170000 2665.2830000 3004.9760000

So, we can say that predictors chosen by LASSO gives a better fit to the response variable.

<>